Nathan Lally: Data Scientist @ New England Statistical Society & HSB (Munich Re)
2019-04-21
Sr. Machine Learning Modeler
I build statistical and machine learning models to predict the price (premium) we should charge consumers for,
So let me take you on a journey through the world of insurance product pricing. Excitement abounds at every turn!
Most products or services are fairly easy to price.
\[ \begin{align} \text{Price} &= \text{Expenses} + \text{Desired Profit} \end{align} \]
This is a somewhat simplified model, but expenses (material, manufacturing, distribution, marketing, etc.) are typically fixed or reasonably easy to estimate.
What makes insurance products different?
Insurance products are not easy to price.
\[ \begin{align} \text{Price} &= \text{Expenses} + \text{Desired Profit} \end{align} \]
where,
\[ \begin{align} \text{Expenses} &= \text{Loss Cost} + \text{Fixed Expenses} \end{align} \]
Loss cost is the amount of money an insurer will pay out for claims incurred by a insured for a given policy period. However, we do not know an insured's loss cost at point of sale, it is a random quantity. To price insurance we must estimate (predict) an insured's loss cost. We call this estimate the expected loss cost or sometimes pure premium (premium before expense and profit loading).
It turns out loss cost itself has two components, each of which is a random quantity.
\[ \begin{align} \text{Loss Cost} &= \text{Claims Frequency} \cdot \text{Claims Severity} \end{align} \]
In the days before digital computing and before many modern advances in statistics, simple methods were used to predict loss cost. Policyholder claims data would be aggregated by several explanatory/predictor variables into what are known as “ratings cells” and very basic statistics would be calculated to estimate loss cost.
\[ \begin{align} \text{Claims Frequency} &= \frac{\text{Claim Count}}{\text{Exposure}}\\ \text{Claims Severity} &= \frac{\text{Claim Cost}}{\text{Claim Count}}\\ \end{align} \]
The next slide shows an example of this methodology. The data used throughout this presentation is publicly available and comes from a major French auto insurer in 2004.
Is there anything suspect with this methodology?
| Gender | VehUsage | binAge | claim_count | claim_cost | exposure | severity | frequency | loss_cost |
|---|---|---|---|---|---|---|---|---|
| Male | Professional | [17.9,25.9] | 7 | 16523.17 | 13.367 | 2360.453 | 0.5236777 | 1236.1165 |
| Male | Professional | (25.9,33.8] | 104 | 187051.78 | 315.085 | 1798.575 | 0.3300697 | 593.6550 |
| Male | Professional | (33.8,41.7] | 115 | 278157.78 | 430.341 | 2418.763 | 0.2672299 | 646.3660 |
| Male | Professional | (41.7,49.6] | 135 | 309184.28 | 496.930 | 2290.254 | 0.2716680 | 622.1888 |
| Male | Professional | (49.6,57.5] | 147 | 275057.24 | 596.273 | 1871.138 | 0.2465314 | 461.2941 |
| Male | Professional | (57.5,65.4] | 81 | 156997.70 | 275.397 | 1938.243 | 0.2941209 | 570.0778 |
| Male | Professional | (65.4,73.3] | 20 | 46324.39 | 47.924 | 2316.219 | 0.4173274 | 966.6219 |
| Male | Professional | (73.3,81.2] | 8 | 24286.42 | 20.089 | 3035.802 | 0.3982279 | 1208.9412 |
| Male | Professional | (81.2,89.1] | 0 | 0.00 | 3.000 | 0.000 | 0.0000000 | 0.0000 |
It turns out that choosing meaningful rating cells and estimating their associated loss costs is more of an art than a science. Actuaries would need to turn to intuition and assumptions choose the variables that define ratings cells and to adjust values that did not seem reasonable, especially for loss cost estimates where exposure is very limited.
Fortunately for the insurance industry, statisticians continued to develop useful models and methods throughout the 20th century (they are still at it trust me). Actuaries and other insurance professionals would begrudgingly begin to use these models for product pricing; generally a decade or two after their introduction.
The image below depicts an actuary fighting with her managers after being told to use R for statistical modeling rather than continuing to create tables in Excel.
Just kidding. It is a stock photo from the BLS Occupational Outlook Handbook site on actuarial careers.
One such statistical model is the generalized linear model (GLM). GLMs were developed in the late 1970s and became popular in insurance pricing applications in the 1990s. GLMs and their extensions are still used to this day in insurance pricing.
\[ \begin{align} \mathbb{E}[Y|\pmb{x}] &= g^{-1}\left(\beta_0 + \pmb{x}'\pmb{\beta}\right) \end{align} \]
The outcome or dependent variable \( Y \) is assumed to be generated from a distribution in the exponential family (more on that in a bit), the row vector \( \pmb{x} \) encodes information from a set of predictor variables, \( \beta_0 \) is a called the model intercept, and the vector \( \pmb{\beta} \) represents the regression weights associated with the predictor variables.
Let \( N \) be a random variable representing claims counts and \( Z \) be a random variable representing claims costs. A popular (and generally useful) assumption in insurance pricing is that realizations of \( N \) are generated by a Poisson distribution and \( Z \) a gamma distribution. \[ \begin{align} f(n) &= \lambda^{n}\frac{e^{-\lambda}}{n!} \ \text{for } n \ge 0\\ f(z) &= \frac{\beta^\alpha}{\Gamma(\alpha)}z^{\alpha-1}e^{-\beta z}\ \text{for } z > 0\\ \end{align} \]
For years, the most popular way to model loss cost was to model claims frequency and claims severity separately with two unique models. The claims frequency model would be fit with data from all available policies while the claims severity model would be fit to only the data where claims had occurred.
Poisson GLM \[ \begin{equation} \lambda_i = e^{\left( \alpha_0 + \pmb{x}_i'\pmb{\alpha} + \log(c_i)\right)} \end{equation} \]
Gamma GLM \[ \begin{equation} \theta_i = e^{\left( \beta_0 + \pmb{x}_i'\pmb{\beta}\right)} \end{equation} \]
Loss Cost \[ \begin{equation} \mu_i = \lambda_i \theta_i \end{equation} \]
As I said, we are fortunate statisticians don't stop thinking. There has to be some distribution out there that can model loss cost directly rather than requiring two sub-models. In fact, such distributions have been discovered. Perhaps the most appropriate for this application is the compound Poisson-gamma distribution.
\[ \begin{align} N &\sim \text{Poisson}(\lambda)\\ Z &\sim \Gamma(\alpha, \beta)\\ Y &= \sum_{i=1}^N Z_i \end{align} \]
Which is a special case of the exponential dispersion model called the Tweedie distribution. Believe it or not, something this dry revolutionized insurance pricing.
OK let's translate all that mess into something straight forward. For the Tweedie GLM,
\[ \begin{align} \mu_i &= e^{\beta_0 + \pmb{x}_i'\pmb{\beta} + \log(c_i)} = e^{\beta_0 + \pmb{x}_i'\pmb{\beta}}\cdot c_i \end{align} \]
\[ \begin{align} \text{base loss cost} &= e^{\beta_0} \end{align} \]
\[ \begin{align} \text{adjustment factor} &= e^{\pmb{x}_i'\pmb{\beta}} \end{align} \]
It's even easier to understand with pictures though. The next several slides illustrate the results of a Tweedie GLM fit to the French auto claims data.
The Tweedie GLM is still a common model used to predict loss cost in the insurance industry. It can be used to produce ratings plans that are easy to interpret. However it is not without its limitations including but not limited to,
When dealing with potentially thousands of variables this can be quite cumbersome.
Statistical learning is a branch of (some would argue synonym for) machine learning (ML) that uses statistical theory and algorithms to automatically discover patterns and relationships in data. After learning from observed data, statistical learning models can make predictions about future events without explicit instructions from a human programmer.
Machine learning can be viewed as a subset of artificial intelligence (AI).
To predict insurance loss costs we use what are called supervised learning algorithms. Supervised learning methods attempt to learn functions that map input information to outputs.
\[ \begin{align} y_i &= f(\pmb{x}_i) + \epsilon_i \end{align} \]
In our example we estimate a function that maps predictor variable values to expected auto insurance loss cost.
\[ \begin{align} \hat{y_i} &= \widehat{f}(\text{License Age}_i,...,\text{Max Speed}_i) \end{align} \]
We will use gradient boosting machines (GBM) with a Tweedie loss function to build a predictive model for loss cost.
Trust me, GBMs are interesting and work very well for insurance pricing data…
The GBM can give provide us with an assessment of variable importance. These are the variables that when their values change, have the largest impact on change in predictions.
The following slides show the marginal effects of each predictor variable on predicted loss cost.
Data science is a rapidly changing field. In addition to fitting statistical and machine learning models, a data scientist must be familiar with,